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Abstract 


We give new algorithms for learning halfspaces in the challenging malicious noise model, where 
an adversary may corrupt both the labels and the underlying distribution of examples. Our algo- 
rithms can tolerate malicious noise rates exponentially larger than previous work in terms of the 
dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave 
distributions. 


We give poly (n, 1 /£)-time algorithms for solving the following problems to accuracy €: 


e Learning origin-centered halfspaces in R" with respect to the uniform 
distribution on the unit ball with malicious noise rate n = OQ (€? / log(n/e)). 
(The best previous result was Q(e/(nlog(n/e))!/*).) 


e Learning origin-centered halfspaces with respect to any isotropic log- 
concave distribution on R” with malicious noise rate n = Q (€? / log? (n/&)). 
This is the first efficient algorithm for learning under isotropic log-concave 
distributions in the presence of malicious noise. 


We also give a poly(n, 1/£)-time algorithm for learning origin-centered halfspaces under any 
isotropic log-concave distribution on R” in the presence of adversarial label noise at rate r| = 
Q(e3/log(1/e)). In the adversarial label noise setting (or agnostic model), labels can be noisy, 
but not example points themselves. Previous results could handle n = O(£) but had running time 
exponential in an unspecified function of 1/e. 

Our analysis crucially exploits both concentration and anti-concentration properties of isotropic 
log-concave distributions. Our algorithms combine an iterative outlier removal procedure using 
Principal Component Analysis together with smooth" boosting. 


Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, half- 
space learning, linear classifiers 
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1. Introduction 


A halfspace is a Boolean-valued function of the form f = sign(Y7.., wix; — 9). Learning halfspaces 
in the presence of noisy data is a fundamental problem in machine learning. In addition to its 
practical relevance, the problem has connections to many well-studied topics such as kernel meth- 
ods (Shawe-Taylor and Cristianini, 2000), cryptographic hardness of learning (Klivans and Sher- 
stov, 2006), hardness of approximation (Feldman et al., 2006; Guruswami and Raghavendra, 2006), 
learning Boolean circuits (Blum et al., 1997), and additive/multiplicative update learning algorithms 
(Littlestone, 1991; Freund and Schapire, 1999). 

Learning an unknown halfspace from correctly labeled (non-noisy) examples is one of the best- 
understood problems in learning theory, with work dating back to the famous Perceptron algorithm 
of the 1950s (Rosenblatt, 1958) and a range of efficient algorithms known for different settings 
(Novikoff, 1962; Littlestone, 1987; Blumer et al., 1989; Maass and Turan, 1994). Much less is 
known, however, about the more difficult problem of learning halfspaces in the presence of noise. 

Important progress was made by Blum et al. (1997) who gave a polynomial-time algorithm for 
learning a halfspace under classification noise. In this model each label is flipped independently 
with some fixed probability; the noise does not affect the actual example points themselves, which 
are generated according to an arbitrary probability distribution over R". 

In the current paper we consider a much more challenging malicious noise model. In this model, 
introduced by Valiant (1985) (see also Kearns and Li 1993), there is an unknown target function f 
and distribution D over examples. Each time the learner receives an example, independently with 
probability 1 — n it is drawn from D and labeled correctly according to f, but with probability n it 
is an arbitrary pair (x, y) which may be generated by an omniscient adversary. The parameter y is 
known as the "noise rate." 

Malicious noise is a notoriously difficult model with few positive results. It was already shown 
by Kearns and Li (1993) that for essentially all concept classes, it is information-theoretically im- 
possible to learn to accuracy 1 — € if the noise rate r| is greater than £/(1-- £). Indeed, known 
algorithms for learning halfspaces (Servedio, 2003; Kalai et al., 2008) or even simpler target func- 
tions (Mansour and Parnas, 1998) with malicious noise typically make strong assumptions about 
the underlying distribution D, and can learn to accuracy 1 — € only for noise rates n much smaller 
than €. We describe the most closely related work that we know of in Section 1.2. 

In this paper we consider learning under the uniform distribution on the unit ball in R", and 
more generally under any isotropic log-concave distribution. The latter is a fairly broad class of dis- 
tributions that includes spherical Gaussians and uniform distributions over a wide range of convex 
sets. Our algorithms can learn from malicious noise rates that are quite high, as we now describe. 


1.1 Main Results 


Our first result is an algorithm for learning halfspaces in the malicious noise model with respect to 
the uniform distribution on the n-dimensional unit ball: 


Theorem 1 There is a poly(n,1/£)-time algorithm that learns origin-centered halfspaces to accu- 
racy 1 — € with respect to the uniform distribution on the unit ball in n dimensions in the presence 
of malicious noise at rate n = Q(£?/ log(n/&)). 


The condition on 7 is expressed using Q and not O because we are showing that a weak upper 
bound on the noise rate suffices to achieve accuracy 1 — €. 
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Via a more sophisticated algorithm, we can learn in the presence of malicious noise under any 
isotropic log-concave distribution: 


Theorem 2 There is a poly(n,1/£)-time algorithm that learns origin-centered halfspaces to accu- 
racy 1 — € with respect to any isotropic log-concave distribution over R" and can tolerate malicious 
noise at rate v = Q(e3 / log? (n/e)). 


We are not aware of any previous polynomial-time algorithms for learning under isotropic log- 
concave distributions in the presence of malicious noise. 

Finally, we also consider a related noise model known as adversarial label noise. In this model 
there is a fixed probability distribution P over R” x {—1,1} (i.e., over labeled examples) for which 
a 1 —7 fraction of draws are labeled according to an unknown halfspace. The marginal distribution 
over R” is assumed to be isotropic log-concave; so the idea is that an “adversary” chooses an y frac- 
tion of examples to mislabel, but unlike the malicious noise model she cannot change the (isotropic 
log-concave) distribution of the actual example points in R". Learning with adversarial label noise 
is clearly harder than with independent misclassification noise—the ability to choose which labels 
to corrupt allows the adversary to coordinate their effects to an extent. 

For the adversarial label noise model we prove: 


Theorem 3 There is a poly (n, 1/£)-time algorithm that learns origin-centered halfspaces to accu- 
racy 1 — € with respect to any isotropic log-concave distribution over R” and can tolerate adversar- 
ial label noise at rate n = Q(e? /log(1/e)). 


1.2 Previous Work 


Our work builds on a number of lines of research. 


1.2.1 MALICIOUS NOISE 


General-purpose tools developed by Kearns and Li (1993) (see also Kearns et al. 1994) directly 
imply that halfspaces can be learned for any distribution over the domain in randomized poly(n,1/e) 
time with malicious noise at a rate Q(¢/n); the algorithm repeatedly picks a random subsample of 
the training data, hoping to miss all the noisy examples. Kannan (see Arora et al. 1993) devised 
a deterministic algorithm with a O(£/n) bound that repeatedly exploits Helly's Theorem to find 
a group of n+ 1 examples that includes a noisy example, then removes the group. Kalai et al. 
(2008) showed that the poly(n,1/€)-time averaging algorithm (Servedio, 2001) tolerates noise at a 
rate O(£/ /n) when the distribution is uniform. They also described an improvement to Q(e/n!/^) 
based on the observation that uniform examples will tend to be well-separated, so that pairs of 
examples that are too close to one another can be removed. 


1.2.2 ADVERSARIAL LABEL NOISE 


Kalai, et al. showed that if the distribution over the instances is uniform over the unit ball, the 
averaging algorithm tolerates adversarial label noise at a rate OQ (£/4/log(1/&£)) in poly(n,1/e) time. 
(In that paper, learning in the presence of adversarial label noise was called “agnostic learning".) 
They also described an algorithm that fits low-degree polynomials that tolerates noise at a rate within 


an additive € of the accuracy, but in poly (nl a] time; for log-concave distributions, their algorithm 
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took poly (n4 / 9h time, for an unspecified function d. The latter algorithm does not require that the 
distribution is isotropic, as ours does. 


1.2.3 RoBUST PCA 


Independently of this work, Xu et al. (2009) designed and analyzed an algorithm that performs prin- 
cipal component analysis when some of the examples are corrupted arbitrarily, as in the malicious 
noise model studied here. Also, the thesis of Brubaker (2009) presents a “Robust PCA” algorithm 
which is a PCA variant aimed at ameliorating the effects of noisy examples. 


1.3 Techniques 


Here is a high-level description of the main techniques in our analysis. 


1.3.1 OUTLIER REMOVAL 


Consider first the simplest problem of learning an origin-centered halfspace with respect to the uni- 
form distribution on the n-dimensional ball. A natural idea is to use a simple "averaging" algorithm 
that takes the vector average of the positive examples it receives and uses this as the normal vector 
of its hypothesis halfspace. Servedio (2001) analyzed this algorithm for the random classification 
noise model, and Kalai et al. (2008) extended the analysis to the adversarial label noise model. 

Intuitively the “averaging” algorithm can only tolerate low malicious noise rates because the 
adversary can generate noisy examples which "pull" the average vector far from its true location. 
Our main insight is that the adversary does this most effectively when the noisy examples are coor- 
dinated to pull in roughly the same direction. We use a form of outlier detection based on Principal 
Component Analysis to detect such coordination. This is done by computing the direction w of 
maximal variance of the data set; if the variance in direction w is suspiciously large, we remove 
from the sample all points x for which (w-x)* is large. Our analysis shows that this causes many 
noisy examples, and only a few non-noisy examples, to be removed. 

We repeat this process until the variance in every direction is not too large. (This cannot take too 
many stages since many noisy examples are removed in each stage.) While some noisy examples 
may remain, we show that their scattered effects cannot hurt the algorithm much. 

Thus, in a nutshell, our overall algorithm for the uniform distribution is to first do outlier re- 
moval! by an iterated PCA-type procedure, and then simply run the averaging algorithm on the 
remaining *cleaned-up" data set. 


1.3.2 EXTENDING TO LOG-CONCAVE DISTRIBUTIONS VIA SMOOTH BOOSTING 


We are able to show that the iterative outlier removal procedure described above is useful for 
isotropic log-concave distributions as well as the uniform distribution: if examples are removed 
in a given stage, then many of the removed examples are noisy and only a few are non-noisy (the 
analysis here uses concentration bounds for isotropic log-concave distributions). However, even if 
there were no noise in the data, the average of the positive examples under an isotropic log-concave 





1. We note briefly that the sophisticated outlier removal techniques of Blum et al. (1997) and Dunagan and Vempala 
(2004) do not seem to be useful in our setting; those works deal with a strong notion of outliers, which is such that 
no point on the unit ball can be an outlier if a significant fraction of points are uniformly distributed on the unit ball. 
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distribution need not give a high-accuracy hypothesis. Thus the averaging algorithm alone will not 
suffice after outlier removal. 

To get around this, we show that after outlier removal the average of the positive examples gives 
a (real-valued) weak hypothesis that has some nontrivial predictive accuracy. (Interestingly, the 
proof of this relies heavily on anti-concentration properties of isotropic log-concave distributions!) 
A natural approach is then to use a boosting algorithm to convert this weak learner into a strong 
learner. This is not entirely straightforward because boosting “skews” the distribution of examples; 
this has the undesirable effects of both increasing the effective malicious noise rate, and causing 
the distribution to no longer be isotropic log-concave. However, by using a “smooth” boosting 
algorithm (Servedio, 2003) that skews the distribution as little as possible, we are able to control 
these undesirable effects and make the analysis go through. (The extra factor of € in the bound of 
Theorem 2 compared with Theorem 1 comes from the fact that the boosting algorithm constructs 
^] /£-skewed" distributions.) 

We note that our approach of using smooth boosting is reminiscent of earlier work (Servedio, 
2002, 2003), but the current algorithm goes well beyond that. Servedio (2002) did not consider a 
noisy scenario, and Servedio (2003) only considered the averaging algorithm without any outlier 
removal as the weak learner (and thus could only handle quite low rates of malicious noise in our 
isotropic log-concave setting). 


1.3.3 TOLERATING ADVERSARIAL LABEL NOISE 


Finally, our results for learning under isotropic log-concave distributions with adversarial label noise 
are obtained using a similar approach. The algorithm here is in fact simpler than the malicious 
noise algorithm: since the adversarial label noise model does not allow the adversary to alter the 
distribution of the examples in R”, we can dispense with the outlier removal and simply use smooth 
boosting with the averaging algorithm as the weak learner. (This is why we get a slightly better 
quantitative bound in Theorem 3 than Theorem 2). 


1.3.4 ORGANIZATION 


For completeness we review the precise definitions of isotropic log-concave distributions and the 
various learning models in Section 2. We present the simpler and more easily understood uniform 
distribution analysis in Section 3. We extend the algorithm and analysis to isotropic log-concave 
distributions in Section 4. Learning with adversarial label noise is treated in Section 5. We conclude 
in Section 6. 


2. Definitions and Preliminaries 


In this section, we provide some definitions and lemmas that will be used throughout the paper. 


2.1 Learning with Malicious Noise 


Given a probability distribution D over R”, and a target function f : R” — {—1,1}, we define the 
oracle EX} (f, D) as follows: 


e with probability 1 — 7 the oracle draws x according to D, and outputs (x, f(x)), and 
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e with probability 1] the oracle outputs an arbitrary (x,y) pair. This “noisy” example can be 
thought of as being generated adversarially and can depend on the state of the learning algo- 
rithm and previous draws from the oracle. 


Given a data set drawn from EX,(f,), we often refer to the examples (x, f(x)) (that came 
from D) as "clean" examples and the remaining examples (x, y) as “dirty” examples. 

For a set .$ of probability distributions and a set F of possible target functions, we say that 
a learning algorithm A learns F to accuracy 1 — € with respect to § in the presence of malicious 
noise at a rate n if the following holds: for any f € F, and D € S, given access to EXy(f,D), 
with probability at least 1/2, the output hypothesis h generated by A satisfies Pry. p[h(x) Z f(x) € 
€. (The probability of success may be amplified arbitrarily close to 1 using standard techniques 
(Haussler et al., 1991).) 

Since scaling x by a positive constant does not affect its classification by a linear classifier, 
drawing examples uniformly from the unit ball is equivalent to drawing them uniformly from the 
surface S"-! of the unit sphere. When this is the distribution, we may also assume w.l.o.g. that 
even noisy examples (x, y) have x € S"! —this is simply because a learning algorithm can trivially 
identify and ignore any noisy example (x, y) that has ||x|| Z 1. 


2.2 Log-concave Distributions 


A probability distribution over R” is said to be log-concave if its density function is exp( —w(x)) for 
a convex function y. 

A probability distribution over R" is isotropic if the mean of the distribution is 0 and the covari- 
ance matrix is the identity, that is, E[xix;] = 1 for i= j and 0 otherwise. 

Isotropic log-concave (henceforth abbreviated i.l.c.) distributions are a fairly broad class of 
distributions. It is well known that any distribution induced by taking a uniform distribution over 
an arbitrary convex set and applying a suitable linear transformation to make it isotropic is then 
isotropic and log-concave. For an excellent treatment on basic properties of log-concave distribu- 
tions, see Lovász and Vempala (2007). 

We will use the following facts: 


Lemma 4 (Lovász and Vempala 2007) Let D be an isotropic log-concave distribution over R" 
and a € S"! any direction. Then for x drawn according to D, the distribution of a- x is an isotropic 
log-concave distribution over R. 


Lemma 5 (Lovász and Vempala 2007) Any isotropic log-concave distribution D over R" has light 
tails, 


Pr [lix] > Bv] « e-9*?. 
If n — 1, the density of D is bounded: 


P b|| € |b— al. 
Pr [x e [a,b] < [bal 


3. The Uniform Distribution and Malicious Noise 


In this section we prove Theorem 1. As described above, our algorithm first does outlier removal 
using PCA and then applies the “averaging algorithm.” 
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We may assume throughout that the noise rate r| is smaller than some absolute constant, and 
that the dimension n is larger than some absolute constant. 


3.1 The Algorithm: Removing Outliers and Averaging 
Consider the following Algorithm Amu: 
Algorithm A mu: 


1. Draw a sample S of m = poly(n/£) many examples from the malicious oracle. 
2. Identify the direction w € S"! that maximizes 





def 
o, = XL (waxy. 
(x.y)es 


ifa. < 
3. Remove from S every example that has (w-x)? > ogm, Go to Step 2. 
4. For the examples S that remain let v = B Y(x.y)esX and output the linear classifier hy defined 


10mlogm : 
—, - then go to Step 4 otherwise go to Step 3. 


by hy(x) = sgn(v - x). 











We first observe that Step 2 can be carried out in polynomial time: 


Lemma 6 There is a polynomial-time algorithm that, given a finite collection S of points in R", 
outputs w € S"-! that maximizes Y cs(w- x}. 


Proof. By applying Lagrange multipliers, we can see that the optimal w is an eigenvector of A — 
YxcsXx!. Further, if À is the eigenvalue of w, then Y,-s(w- x)? = w'Aw = w'(Aw) =A. The 
eigenvector w with the largest eigenvalue can be found in polynomial time (see, e.g., Jolliffe 2002). 
i 


Before embarking on the analysis we establish a terminological convention. Much of our analy- 
sis deals with high-probability statements over the draw of the m-element sample S; it is straightfor- 
ward but quite cumbersome to explicitly keep track of all of the failure probabilities. Thus we write 
“with high probability" (or “w.h.p.”) in various places below as a shorthand for “with probability at 
least 1 — 1/poly (n/£)." The interested reader can easily verify that an appropriate poly(n/£) choice 
of m makes all the failure probabilities small enough so that the entire algorithm succeeds with 
probability at least 1/2 as required. 


3.2 Properties of the Clean Examples 


In this subsection we establish properties of the clean examples that were sampled in Step 1 of Amu. 
The first says that no direction has much more variance than the expected variance of 1 /n: 


Lemma 7 W.h.p. over a random draw of £ clean examples Sciean, we have 
1 1 O log£ 
max = L (a " x) «x — + O(n+log£) i 
acs"! £ (Xy) €Sclean n £ 


Proof. 'The proof uses standard tools from VC theory and is in Appendix A. [| 


The next lemma says that in fact no direction has too many clean examples lying far out in that 
direction: 
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292 B2n/2 
Lemma 8 For any D > 0 and x > 1, if Sctean is a random set of £ > ay wpe clean examples 


1--x)In(1--X) 
then w.h.p. we have 


1 
max 5|{X € Sae: (a: x) > P)] < (1 +K)ePr?, 
acgn-1 l 


Proof. In Appendix B. E 


3.3 What is Removed 
In this section, we provide bounds on the number of clean and dirty examples removed in Step 3. 


The first bound is a Corollary of Lemma 8. 


Corollary 9 W.h.p. over the random draw of the m-element sample S, the number of clean examples 
removed during any one execution of Step 3 in Amy is at most 6nlogm. 


Proof. Since the noise rate 1 is sufficiently small, w.h.p. the number £ of clean examples is at least 


(say) m/2. We would like to apply Lemma 8 with x = 5/^nlog/ and f = 
may do this because we have 


. 22 4Pn/2 . 5 
O(1):n^B^e » O(1)-n(logm)m eo. \e" 2 
(1+«)In(1+«)~ (1+k)In(1+«) ^ logm 


oem. and indeed we 








255 


for n sufficiently large. Since clean points are only removed if they have (a- x)? > B*, Lemma 8 
gives us that the number of clean points removed is at most 


m(1 3aga- P» < 6m?nlog(£) /m? < 6nlogm. 


The counterpart to Corollary 9 is the following lemma. It tells us that if examples are removed in 
Step 3, then there must be many dirty examples removed. It exploits the fact that Lemma 7 bounds 
the variance in all directions a, so that it can be reused to reason about what happens in different 
executions of step 3. 


Lemma 10 W.h.p. over the random draw of S, whenever Amy executes step 3, it removes at least 
Arloen noisy examples from Sairy, the set of dirty examples in S. 


Proof. As stated earlier we may assume that ņ < 1/4. This implies that w.h.p. the fraction T| of 
noisy examples in the initial set S is at most 1/2. Finally, Lemma 7 implies that m = Q (n?) suffices 
for it to be the case that w.h.p., for all a € S"-!. for the original multiset Sclean of clean examples 
drawn in step 1, we have 
,.2m 
Y. qux <—. (1) 
(x,y) C Sctean d 

We shall say that a random sample S that satisfies all these requirements is "reasonable". We will 
show that for any reasonable data set, the number of noisy examples removed during the execution 
of step 3 of Amu is at least fmlogm, 
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If we remove examples using direction w then it means ves(w-x)? > lOmlogm Since Sis 
p g (x,y)es n 
reasonable, by (1) the contribution to the sum from the clean examples that survived to the current 


stage is at most 2m/n so we must have 


Y (w-x)? > 10mlog(m) /n—2m/n > 9mlog(m)/n. 


(x,y) €Sainy 


Let us decompose Sginty into NU F where N ("near") consists of those points x s.t. (w x)? < 
10log(m)/n and F (“far”) is the remaining points for which (w- x)? > 10log(m)/n. Since |N| < 
| Sairty| < nm, (any dirty examples removed in earlier rounds will only reduce the size of Sairty) we 
have 


Y (wx) < (Am) 10log(m)/n 
(x.)eN 


and so 
IF| 2 Y (w-x)? > 9mlog(m)/n— (Am) 10log(m) /n > 4mlog(m) /n 
(xy)eF 


(the last line used the fact that fj < 1/2). Since the points in F are removed in Step 3, the lemma is 
proved. E 


3.4 Exploiting Limited Variance in Any Direction 


In this section, we show that if all directional variances are small, then the algorithm’s final hypoth- 
esis will have high accuracy. 

We first recall a simple lemma which shows that a sample of “clean” examples results in a 
high-accuracy hypothesis for the averaging algorithm: 


Lemma 11 (Servedio 2001) Suppose X1,...,Xm are chosen uniformly at random from S"-!, and 
a target weight vector u € S"^! produces labels y, = sign(u-x1),..., Ym = sign(u- Xm). Let v = 
Ly? yX. Then w.h.p. u-v = Q(), while ||v — (u - v)u|| = O(,/log(n)/m). 

Now we can state Lemma 12. 


Lemma 12 Let S — Saca U Sairty be the sample of m examples drawn from the noisy oracle EXy (fa): 
Let 


Shean be those clean examples that were never removed during step 3 of Am, 


Sürty be those dirty examples that were never removed during step 3 of Amu, 


Shir s ; : ; 
e n= ki | il p that is, the fraction of dirty examples among the examples that survive step 3, 
clean ~“ dirty 
and 
Setean — Sa : . 
e q= Ace. the ratio of the number of clean points that were erroneously removed to the 
clean ^" dirty 


size of the final surviving data set. 
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Suppose that |S'| > m/2 (i.e., fewer than half the total points were removed) 


clean 


tas = 9 U Süny: 
and that, for every direction w € S"-! we have 


10ml 
2 def Y (w:x)? < —— Em 
(x.y)eS' n 


d 
Then w.h.p. over the draw of S, the halfspace with normal vector v i wI Lix yes YX has error rate 





O (vir c/s ne) 
Proof. The claimed bound is trivial unless n’ < 0(1)/logm and a < o(1)/,/n, so we shall freely 
use these bounds in what follows. 
Let u be the unit length normal vector for the target halfspace. Let Velean be the average of all 
the clean examples, Vairty be the average of the dirty (noisy) examples that were not deleted (i.e., the 
examples in Stn» and Vae; be the average of the clean examples that were deleted. Then 


1 
v= = so x yx 
[Skean U Sein (x.»)eS, UStinty 


clean 


1 
= gy Jy] Y »x|- L.g*p- L yx 
| clean U dirty | (x,y) €Setean (x,y) ESiinty (x, y ) €Sclcan — S aea 


LS 
v = (1—m' - O)vVcean +1) Vaigy — 2¥del- 


Let us begin by exploiting the bound on the variance in every direction to bound the length of 
Väiny: For any w € S" -! we know that 


10m1 
3 (w:x)? < —— Bm and hence E (wx< 


(xy)es’ n (2Y) E Siny n 


10mlogm 


since Sq, CS’. Since |Sginy| < N'M, the fact that ||r||1 < V k|[r||» for any vector r € R* gives 


y aie rem 


(XY) ES liny n 





Taking w to be the unit vector in the direction of Vimy» we have || vai, || = 


1 1 Y Was 10mlogm 
E | ( t <i] | ( y n. |S: In 
dirty! (X.Y) EShiny dirty! (XY) EShiny dirty 


(2) 


/ 
W^ Vaigy = 


Because the domain distribution is uniform, the error of A, is proportional to the angle between 
v and u, in particular, 


Pr[hy 7 f) = Zaretan (MEI) < (ij) I rl. 3) 
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We have that ||v — (v - u)u|| equals 
||(L — m^ + oO (Vctean — (Vetean : u)u) +N’ (Vai — (Vairty : u)u) — O(Vaer — (Vai - u)u)]| 
< 2||Vctean — (Vctean : u)u|| 2-1 [Vain | + Q||Vaer]| 


where we have used the triangle inequality and the fact that or] are “small.” Lemma 11 lets us 
bound the first term in the sum by O(,/log(n)/m), and the fact that Vae; is an average of vectors of 
length 1 lets us bound the third by a. For the second term, Equation (2) gives us 


10m(n')?logm 10m‘ logm | 20n'logm 
"iv^. || « = < 
n || Fatty | = y |S! In IS" [n = 5 E 











dirty 


where for the last equality we used |$"| > m/2. We thus get 


lv- (v-u)ul| < 0 (/tog(r)/m) + /20n'1og(m) [n +0. (4) 


Now we consider the denominator of (3). We have 





u-v— (1 —q --o)(u * Vclean ) +n'u * Vdirty — OU: Vgel- 


Similar to the above analysis, we again use Lemma 11 (but now the lower bound u -v > Q(1/,/n)), 
Equation (2), and the fact that ||Vae1|| < 1. Since & and y’ are “small,” we get that there is an absolute 


constant c such that u - v > c/,/n— 4/20n'log(m) /n — a. Combining this with (4) and (3), we get 


O /logn + / 20n’ logm +a 
m " nlogn 
< + /y logm 4- avn |. 
m 


Pr, z f] < =O 


/ 
a (5 / 200 Jogi a) 











3.5 Proof of Theorem 1 


By Corollary 9, w.h.p. each outlier removal stage removes at most 6n log m clean points. 

Since, by Lemma 10, each outlier removal stage removes at least amlogm noisy examples, there 
must be at most O(n/(logm)) such stages. Consequently the total number of clean examples re- 
moved across all stages is O(n”). Since w.h.p. the initial number of clean examples is at least 
3m/4, this means that the final data set (on which the averaging algorithm is run) contains at least 
3m/4 — O(n?) clean examples, and hence at least 3/4 — O(n?) examples in total. The condition 
m 3» n? means that the number of surviving examples will be at least m/2. Consequently the value 
of a from Lemma 12 after the final outlier removal stage (the ratio of the total number of clean 
examples deleted, to the total number of surviving examples) is at most gon. 

The standard Hoeffding bound implies that w.h.p. the actual fraction of noisy examples in the 
original sample S is at most T] + J/O(logm)/m. It is easy to see that w.h.p. the fraction of dirty 
examples does not increase (since each stage of outlier removal removes more dirty points than 
clean points, for a suitably large poly (n/£) value of m), and thus the fraction n’ of dirty examples 
among the remaining examples after the final outlier removal stage is at most n + /O(logm) /m. 

Applying Lemma 12, for a suitably large value m = poly(n/£), we obtain Pr[hy Z f] < 
O (//nlogm) . Rearranging this bound, we can learn to accuracy € even for n = O(e?/log(n/e)). 
This completes the proof of the theorem. [| 
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4. Isotropic Log-concave Distributions and Malicious Noise 


Our algorithm Amic that works for arbitrary isotropic log-concave distributions uses smooth boost- 
ing. 


4.1 Smooth Boosting 


A boosting algorithm uses a subroutine, called a weak learner, that is only guaranteed to output 
hypotheses with a non-negligible advantage over random guessing.? The boosting algorithm that we 
consider uses a confidence-rated weak learner (Schapire and Singer, 1999), which predicts {—1,1} 
labels using continuous values in [-1, 1]. Formally, the advantage of a hypothesis h’ with respect to 
a distribution D’ is defined to be E...» |W (x) f (x)], where f is the target function. 


For the purposes of this paper, a boosting algorithm makes use of the weak learner, an example 
oracle (possibly corrupted with noise), a desired accuracy €, and a bound y on the advantage of the 
hypothesis output by the weak learner. 


A boosting algorithm that is trying to learn an unknown target function f with respect to some 
distribution D repeatedly simulates a (possibly noisy) example oracle for f with respect to some 
other distribution D’ and calls a subroutine A\eqx with respect to this oracle, receiving a weak 
hypothesis, which maps R” to the continuous interval [—1, 1]. 


After repeating this for some number of stages, the boosting algorithm combines the weak 
hypotheses generated during its various calls to the weak learner into a final aggregate hypothesis 
which it outputs. 


Let D, D' be two distributions over R”. We say that D' is (1/£)-smooth with respect to D if 
D'(E) < (1/e)D(E) for all events E. 

The following lemma from Servedio (2003) (similar results can be readily found elsewhere, 
see, e.g., Gavinsky 2003) identifies the properties that we need from a boosting algorithm for our 
analysis. 


Lemma 13 (Servedio 2003) There is a boosting algorithm B and a polynomial p such that, for 
any €,y > 0, the following properties hold. When learning a target function f using EXy (f, D), 
we have: (a) If each call to Aweax takes time t, then B takes time p(t,1/y,1/€). (b) The weak 
learner is always called with an oracle EXy (f, D') where D' is (1/€)-smooth with respect to D 
and Ww < 1/€. (c) Suppose that for each distribution EXy (f, D') passed to Aweak by B, the output 
of Aweak has advantage Yy. Then the final output h of B satisfies Prceo|h(x) 4 f (x)| < €. 


4.2 The Algorithm 


Our algorithm for learning under isotropic log-concave distributions with malicious noise, Algo- 
rithm Amic, applies the smooth booster from Lemma 13 with the following weak learner, which we 
call Algorithm Amicw. (The value co is an absolute constant that will emerge from our analysis.) 





2. For simplicity of presentation we ignore the confidence parameter of the weak learner in our discussion; this can be 
handled in an entirely standard way. 
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Algorithm A picw: 

1. Draw m = poly(n/e) examples from the oracle EX, (f, D’). 

2. Remove all those examples (x, y) for which ||x|| > /3nlogm. 

3. Repeatedly 
e find a direction (unit vector) w that maximizes Ys cs(w- x)? (see Lemma 6) 
e if Yu es(w-x) € camlog (n/&) then move on to Step 4, and otherwise 
e remove from S all examples (x, y) for which |w-x| > colog(n/£), and iterate again. 

vx 


4. Letv= B Y(x,»)es YX, and return / defined by h(x) = 4:15, if |v-x| < 3nlogm, and h(x) = 


~~ 3nlogm? 
sgn(v - x) otherwise. 











4.3 The Key Claim: The Weak Learner is Effective 


Our main task is to analyze the weak learner. Given the following Lemma, Theorem 2 will be an 
immediate consequence of Lemma 13. 


Lemma 14 Suppose Algorithm Amtew is run using EXy (f, D") where f is an origin-centered half- 
space, D' is (1/€)-smooth w.rt. an isotropic log-concave distribution D, n! < m/e, and y € 


Q(e3/log?(n/e)). Then w.h.p. the hypothesis h returned by Amicw has advantage Q. (ants): 


Before proving Lemma 14, we need to prove some uniformity results on non-noisy examples 
drawn from an isotropic, log-concave distribution. This will enable us to use outlier removal and 
averaging to find a weak learner. 


4.4 Lemmas in Support of Lemma 14 


In this section, let us consider a single call to the weak learner with an oracle EX, (f, D’) where D’ 
is (1/£)-smooth with respect to an isotropic log-concave distribution D and n’ < 1|/e. Our analysis 
will follow the same basic steps as Section 3. 

A preliminary observation is that w.h.p. all clean examples drawn in Step 1 of Algorithm Amicw 
have ||x|| € /3nlog m; indeed, for any given draw of x from D’, the probability that ||x|| > /3nlogm 
is at most = by Lemma 5 together with the fact that 7 is 1/£-smooth with respect to an i.l.c. 
distribution. Therefore, w.h.p., only noisy examples are removed in Step 2 of the algorithm, and we 
shall assume that the distributions D and D’ are in fact supported entirely on (x : ||x|| < /3nlogm}. 
This assumption affects us in two ways: first, it costs us an additional =; in the failure probability 
analysis below (which is not a problem and is in fact swallowed up by our “w.h.p.” notation). 
Second, it means that the overall 1 — € accuracy bound we establish for the entire learning algorithm 
may be slightly worse than the true value. This is because our final hypothesis may always be 
wrong on the examples x that have ||x|| > \/3nlogm and are ignored in our analysis; however such 
examples have probability mass at most 7, under the isotropic log-concave distribution D (again 
by Lemma 5), and thus the additional accuracy cost is at most 4. Since € > 5, this does not affect 
the overall correctness of our analysis. Note that a consequence of this assumption is that we can 
just take h(x) = x 

The remarks about high-probability statements and failure probabilities from Section 3.1 ap- 
ply here as well, and as in Section 3 we write “w.h.p.” as shorthand for “with probability 1 — 


1 /poly (n/e)." 
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We first show that the variance of D’ in every direction is not too large: 
Lemma 15 For any a € S"! we have Ey.» (a: x)?] = O(log?(1/&)). 


Proof. For x chosen according to D, the distribution of a- x is a unit variance log-concave distribu- 
tion by Lemma 4. Thus, for any positive integer k, 


Enola x] < EY? Pr flax e GET) 
i=k XS 





< 24+ (i+1)(1/8) Pr [Ja -x| € (i,i4- 1]] 
i=k 2 
< I (1/6) Y. G+ Pr [|a -x| » i] 
ik xem 
< P fe) Y i Ie! < e (1/6). (Pe) 





i=k 


where the first inequality in the last line uses Lemmas 4 and 5. 
Setting k = In(1/£) completes the proof. | 


The following anticoncentration bound will be useful for proving that clean examples drawn 
from D’ tend to be classified correctly with a large margin. 


Lemma 16 Letu € S"-!. Then 
Ey. ||u x|] > €/8. 


Proof. Clearly 


E...» [|u-x] > (€/4) Pr [Jux] > e/4]. 


Pr 
xv D! 


But by Lemma 5, 
£/2 


1 
P -x| <£€/4| < — P -x| g/4| < — = 1/2. 
Pr lu xp < 8/4] < 5 Pr lux] < 8/4] < $= =1/ 


i 
The next two lemmas are isotropic log-concave analogues of the uniform distribution Lemmas 7 
and 8 respectively. The first one says that w.h.p. no direction a has much more variance than the 
expected variance in any direction: 
Lemma 17 W.h.p. over a random draw of £ clean examples Sciean from D', we have 
1 1 n’/log’ l 
max t » esl < O(1) CE ‘ 
acsi! (X,y)€Sctean £ ve 
Proof. By Lemma 15, for any a € S"^! we have 


Ey-y ((a-x)^] = @(log?(1/e)). 
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Since as remarked earlier we may assume D’ is supported on (x: |x|| < /3nlogm}, we may apply 


Lemmas 25 and 27 (see Appendix A) with functions fa defined by fa = UM This completes the 


proof. E 


The second lemma says that for a sufficiently large clean data set, w.h.p. no direction has too 
many examples lying too far out in that direction: 


O(1)ecP (n In(e-P/e) +logm) 
(1+) In(1+k) 





Lemma 18 For any B > 0 and « > 1, if Sacan is a set of l > 
drawn from D', then w.h.p. we have 


clean examples 


1 
max — 


I\ 
acgn-i z t € Sctean : |a: x| > Bj] < (+k) (i) RUE 


Proof. Lemma 5 implies that for the original isotropic log-concave distribution D, we have 


Pr [|a-x| > B] <e P^. 
x~D 


Since D’ is (1/£)-smooth with respect to D, this implies that 





e Bt! 
P -x| > pj < 5 
Pr [lax] > B] < S © 
In the proof of Lemma 8, we observed that the VC-dimension of 
{{x:|a-x| >B} : ae R”, BER} 
is O(n), so applying Lemma 28 with (5) completes the proof of this lemma. [| 


The following is an isotropic log-concave analogue of Corollary 9, establishing that not too 
many clean examples are removed in the outlier removal step: 


Corollary 19 W.h.p. over the random draw of the m-element sample S from E Xs (f, D'), the num- 
ber of clean examples removed during any one execution of the outlier removal step (final substep 
of Step 2) in Algorithm Amicw is at most 6me? /n^. 


Proof. Since the true noise rate y is assumed sufficiently small, the value n’ < n/e is at most £/4, 
and thus w.h.p. the number £ of clean examples in S is at least (say) m/2. We would like to apply 
Lemma 18 with « = (n/¢)~* and B = colog(n/£), and we may do this since we have 





O(1)&eP (n1n (eef) +1 co 

(1)ee? (nln (ee?) + logm) < OE!) nlogm « O(1 y Je < "n d 
(1+«)In(1 +k) (n/e)-4 logm 2 

for a suitable fixed poly(n/e) choice of m. Since clean points are only removed if they have |a-x| > 

B, Lemma 18 gives us that the number of clean points removed is at most 


(6/6)(n/e)** * 
(n/e) 


1 
m(1 +x): Ze P*! <m < 6me? /n^. 
€ 


The following lemma is an analogue of Lemma 10; it lower bounds the number of dirty examples 
that are removed in the outlier removal step. 
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Lemma 20 W.h.p. over the random draw of S, any time Algorithm Amicw executes the outlier re- 
: i i 
moval step it removes at least Din) "10isy examples. 


Proof. Since our ultimate goal is only to prove that the algorithm succeeds for some n which is o(£), 
we may assume without loss of generality that the original noise rate y is less than £/4. This means 
that n’ < 1/4, and consequently a Chernoff bound gives that w.h.p. the fraction 1)’ of noisy examples 
in S at the beginning of the weak learner's training is at most 1/2. And Lemma 17 implies that for a 
sufficiently large polynomial choice of m, we have that w.h.p. for all a € S"^!, the following holds 
for all the clean examples in the data before any examples were removed: 


Y (a-x)? € emlog?(1/&) (6) 


(x,y) € Sclean 


where c is an absolute constant. We say that a random sample that meets all these requirements is 
“reasonable.” We now set the constant co that is used in the specification of Amicw to be J/2(c + 1). 
We will now show that, for any reasonable sample S, the number of noisy examples removed during 
the first execution of the outlier removal step of Amicw is at least On) 

If we remove examples using direction w then it means Y, cs(w- x)? > comlog?(n/e). Since S 
is reasonable, by (6) the contribution to the sum from the clean examples that have survived until 


this point is at most cmlog?(1/£) so we must have 


Y (w-x)? > (ch — c)mlog? (n/&). 


(xy) Sainty 





Let Say = NUF where N is the examples (x, y) for which x satisfies (w- x)? < c2log?(n/£) and F 
is the other points. We have 


Y, (w:x) < coti mlog" (n/e). 
(x.)eN 


and so, since ||x|| < /3nlogm implies that (w- x)? < 3nlogm for all unit length w, we have 


2 (w-x)? 


jm yey S 














2 et 3nlogm (X)€ Suy 3nlogm (xen 3nlogm 

» (c5 — c)mlog? (n/e) — ci’ mlog? (n/e) 

n 3nlogm 

S mlog? (n/€) 

T 3nlogm 

m 

> 

~ O(n) 
where the next-to-last inequality uses n’ < 1/2 and co = \/2(c +1), and the final one uses m = 
O(poly(n/£)). The points in F are precisely the ones that are removed, and thus the lemma is 
proved. E 
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4.5 Proof of Lemma 14 


We first note that Lemma 20 implies that w.h.p. the weak learner must terminate after at most O(n) 
iterations of outlier removal. 

Let u be the unit length normal vector of the separating halfspace for the target function f. 
Recall that we have assumed without loss of generality that ||x|| X /3nlogm for all x in the training 
set, so that ||v|| < /3nlogrm, and thus the advantage of h with respect to D’ can be expressed as 


Ex.» (v x)f(x)] 
3nlogm 





E, Dh) ()] = () 
and so we shall work on lower bounding Es.» [(v - x) f (x)]. 
As in the proof of Lemma 12, let 


u 


clean De those that are not 


€ Sctean be all of the clean examples in the initial sample S, and S 
removed in any stage of outlier removal; 


e Sdirty be all of the dirty examples in the initial sample S, and Sainty be those that are not removed 
in any stage of outlier removal; 
m ; : : = 
e n= m that is, the noise rate among the examples that survive until the end of training 


clean ^* dirty 


of the weak learner, and 


Sclean — Sees : : 
e a= His Sen the ratio of the number of clean points that were erroneously removed to the 
clean ^^ dirty 


size of the final surviving data set. 


As before we write S' for S... U Süty- Also as before, let Velean be the average of all the clean 


examples, Vairty be the average of the dirty (noisy) examples that were not deleted, and va, be the 
average of the clean examples that were deleted. Then arguing exactly as before, we have 


v = (1— N +) Vojean +1) Veirty — 2Vdel- 
The expectation of voca, Will play a special role in the analysis: 


* def 
Velean — Ey. p [f (x) x]. 


Once again, we will demonstrate the limited effect of Vairty by bounding its length. This time, 
the outlier removal enforces the fact that, for any w € S" -l we have 


Y (w-x)? € cemlog? (n/&). 
(xy)es 


Applying this for the unit vector w in the direction of Väirty as was done in Lemma 12, this implies 


m 
llvaixyll < colog(n/e) T 
dirty 


Next, let us apply this to bound an expression that captures the average harm done by Voirty: 
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[Exo [f(x) (Vairty ` x)]| = | Vairty ` Velean| 


IA 


co log(n/€) | IV egi |- (8) 


Kind om | 


To show that Velean plays a relatively large role, it is helpful to lower bound the length of v5... 
We do this by lower bounding the length of its projection onto the unit normal vector u of the target 
as follows: 


Vélean : U = Exo [(f(x)x) : u] = Exo [sgn(u- x) (x-u)] = Ex~o [|x ul] > €/8, 
by Lemma 16. Since u is unit length, this implies 


EMI > £/8. (9) 


Armed with this bound, we can now lower bound the benefit imparted by Vetean: 


||V 


Enol) Vam z] = z— Y Exobf@(x-2)] 


Sclean (x y) c Sclean 


1 








(yx) ` mm ; 


Sclean (x,y) €Sciean 
Since E[(yx) " Velean] = 
that w.h.p. 


Eno [f(z )(Vclean ° z)| > 2 [| Vétean| |” — O(nlog*/? m \/v |Sclean|- 


Since the noise rate ņ’ is at most n/€ and y certainly less than £/4 as discussed above, another 
Hoeffding bound gives that w.h.p. |Sciean| is at least m/2; thus for a suitably large polynomial choice 
of m, using (9) we have 


(v? , and (yx) v? 


clean 


€ [-3nlog m, 3nlog m], a Hoeffding bound implies 


clean | |? 


E; oy [f (2)(Yciean :2)] > ||Veleanl|” — O(nlog7 m) / /m 1s ru (10) 
Now we are ready to put our bounds together and lower bound the advantage of v. We have 


Ex. [f (x) (v d x)] = (1 = n F o)E|f(x) (Velean D x)] 
-H'E[f (x) (any s x)] —aE [f (x) (Vae : x)] . 
We bound each of the three contributions in turn. First, using 1 — n’ > 1/2 and (10), we have 


v*. I? 
(1 B n + o)E|f(x) (Vclean : x)] 2 ael. 
Next, by (8), we have 


Im Ex» [f (X) (Vainy :)]] < colog(n/e) V 2n' |lvasall- 


Since we may assume that n < c'e? / log? (n/&) for as small a fixed constant c' as we like (recall the 
overall bound of Theorem 2), we get 


colog(n/e)/2w' ||vaeas|| < (€/64)||¥etean!| 
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II¥tean I 


2 
(for a suitably small constant choice of c’), and this is less than ^"^ since ||Ve).an|| 7 €/8. 
Finally Corollary 19, together with the fact that there are at most O(n) iterations of outlier 


Sinn i 


(recalling that both Vae] and all x in the support of D’ have norm at most \/3nlogm) means that 
|XE[f (x) (vaa : x)]| = o(&). 
Combining all these bounds, we get 


removal and the final surviving data set is of size at least m/4, gives us that & < 


* 2 " 2 
E, olf u)(v-x)] > Dial. Pal 


by (9). Together with (7), the proof of Lemma 14 is completed. 


2 
eee 
(€) 2 4 





5. Learning Under Isotropic Log-concave Distributions with Adversarial Label Noise 


In this section, we consider the model where an adversary can change some class labels, but cannot 
otherwise modify examples. 


5.1 The Model 


We now define the model of learning with adversarial label noise under isotropic log-concave dis- 
tributions. In this model the learning algorithm has access to an oracle that provides independent 
random examples drawn according to a fixed distribution P on R" x {—1,1}, where 


e the marginal distribution over R” is isotropic log-concave, and 
e there is a halfspace f such that Pr. p[f(x) Ay] ^ n. 


The parameter 7 is the noise rate. As usual, the goal of the learner is to output a hypothesis h 
such that Prix y)|h(x) # y] € €; if an algorithm achieves this goal, we say it learns to accuracy 
1 — € in the presence of adversarial label noise at rate n. 


5.2 The Algorithm 


Like the algorithm Amıc considered in the last section, the algorithm Aac studied in this section 
applies the smooth boosting algorithm of Lemma 13 to a weak learner that performs averaging. The 
weak learner Aalcw behaves as follows: 





Algorithm Aaicw: 
1. Draw a set S of m examples according to P' (the oracle for a modified distribution provided 
by the boosting algorithm). 
2. Remove all examples (x, y) such that ||x|| > /3nlogm from S. 
3. Let v = I Y(xyesyX. Return the confidence-rated classifier h defined by h(x) = Fal i 
|v-x| € 3nlogm, and h(x) = sgn(v-x) otherwise. 











5.3 Claim About the Weak Learner 


As in the previous section, the heart of our analysis will be to analyze the weak learner. We omit 
discussing the application of the smooth boosting algorithm here, as it is nearly identical to Section 
4. 
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Lemma 21 Suppose Algorithm Axcw is run using P' as the source of labeled examples, where P' 
is a distribution that is (1/€)-smooth with respect to a joint distribution P on R" x {—1,1} whose 
marginal D' on R” is isotropic and log-concave. Further, assume there exists a linear threshold 
function f such that Pray, p|f(x) Ay] € n/€ and N < O cfr). Then with high probability, 


Aalew Outputs a hypothesis with advantage Org Ta ji 


5.4 Lemmas in Support of Lemma 21 


During this section, let us focus our attention on a single call to the weak learner. Let P' be a 
distribution as in Lemma 21 and let D’ be the marginal on R”. We observe that since P’ is (1/£)- 
smooth with respect to P, the marginal D’ of P' is (1/£)-smooth with respect to the marginal D of 
P. 

As in Section 4, we may assume that the support of D’ lies entirely on x such that ||x|| < 
/3nlogm (this negligibly affects the final bounds obtained in our analyses). 

The following technical lemma will be used to limit the extent to which the distribution P' can 
concentrate a lot of noise in one direction. 


Lemma 22 Let E be any event with positive probability under D', and let x = D' (E). For any unit 
length a € R", Ey.» ||a-x| | E] = O (log $). 


Proof. Let B be such that Pr, [la -x| > B] = K. By Lemmas 4 and 5, together with the fact that D’ 
is (1/£) smooth with respect to D, we have 


K< l-pu 
TE 


which implies B < 1 +1n (1). 

Let F be the event that |a -x| > D. We will show that Ey. [|a -x| | E] < Ex..4»[|a -x| | F], and 
then bound E,..;»||a: x| | F]. If Pr[(E — F) U (F — E)]| = 0, then, obviously, Ex. ||a-x| | E] = 
E,W. |a: x| | F]. Suppose Pr[(E — F) U (F — E)] > 0. Then 


Ex.4» [|a x| | E] 
= Ex.»[la-x| | En F]Pr[E NF] -Ex.[la-x| | E-F] 
= Ex.o/[la-x| | EnF]Pr[E NF] -Ex.o[la-x] | E-F] 
(because Pr|E] = Pr|F]) 
< Ex..4»||a: x| | En F]Pr[E OF] +E, Vp [lax] | F - E]Pr[F — E], 


Pr[E — F] 
Pr[F — E] 





because for every x € E — F and every x’ € F — E, 
a-x| € p < [a-x']. 
But 
E,.»[|a-x|| EnF]Pr[En F] - Ex.» [|a-x| | F — E]Pr[F — E] = Ex.» [la -x| | F], 


SO 
E.» ||a - x| | E] < E,» ||a - x| | F]. (11) 
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Now, setting b = |B |, we have 











1 . - 
Eeolex|F] < sg E+) Pla xl Git 
I ; —i+1 
S< p F) 3 a i 
1 O e Pb 
T'(F) € 
= O(b), 
since D'(F) = 9(e^^/£). Combining with (11) completes the proof. | 


5.5 Proof of Lemma 21 


Fix some halfspace f such that Prix )~p| f(x) # y] ^ n. and let u be the unit normal vector of its 
separating hyperplane. 
Let P' be the joint distribution given to Aajicw and let D’ be its marginal on R”. As noted in the 
previous subsection, D’ is (1/£)-smooth with respect to the original marginal distribution D of P. 
First, we bound the advantage of the hypothesis / with respect to P' in terms of the tendency of 
h to agree with the best linear function f: 


E(x P [A(x)y] e E). p [A(x) f (x)] =s E,W 0 [A(x) f (x)] =n. (12) 


Furthermore, as we have assumed without loss of generality that ||x|| < V3nlogm for all exam- 
ples in the training set, and therefore that ||v|| < /3nlogm, we have 


PM 


1 
3nlogm dm 


E,» [h(x)/(x)] = Ex. | 


so we will work on bounding Ey f (x) (x v)]. 


Let P^,,, be obtained by conditioning a random draw (x, y) from P’ on the event that f(x) = y. 
Define Pointy analogously, and let 7... and Dirty be the corresponding marginals on R”. Let 
Väirty = Ex). p. Lyx] 
Votes = Exo» [f(x)x]. 
Note that the linearity of expectation implies that 
* 1 * 
Ey [f(x) (x i v) = (Exo [f (x) (x)]) 'V = Voorrect V = E Y) Vcorrect " (yx). (14) 
(xy)eS 


Equation (14) expresses Ey. [f (x) (x - v)], which is closely related to the advantage of h through 
(13) and (12), as a sum of independent random variables, one for each example. We will bound 
Es.» |f (x) (x- v)] by bounding the expected effect of a random example on its value, and applying 
a Hoeffding bound. 
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Let n/ = Pr). p [f (x) Æ y]. Since P' is 1/e-smooth with respect to P, we have n < n/e. We 
can rearrange the effect of a random example as follows 
Eg). Vorea: OX] = (1 EG ys Vore: GG)X)ly = f(x)] 
WEG yp Vore C7 f (X)x)ly z f(x)] 
= (1—1)Eqxy)~p Vorea: (F(X) x) Ly = f(x m 
HNE gyp [Viorea * (f 
—1'E (x,y)~P'[V cones” 
WE yr [Veorrect * 7 





f(x)x ) #4) (15) 
Since 
E(xy)~P'[Veomect * C )X)] 
=N Egy Vore * GF GO) y A FO) + (1 — n Eq) Vorea: (F(x) x) Ly = f(x)]; 
by replacing the first two terms of (15) with Eq y). [Viorrect: (f(X)x)], we get 
Eq yy.PVionee: OX] = Egyr [Voorrect * C')x)] 
NE gyp Weomect * (F(x)x) Ly # f(x)] 
+1'E(xy)~'[Veorrect *(—f(X)) ly # f(x)] 
E(x y). P [Veomect * C (X)x)] 
=W Eyyup Vore" (F(X)x) Ly # f(x)]. 


Twice applying the linearity of expectation, we get 


E.P I conset É (yx)] = Iona i m MEy [Vane ` (f (x)x) ly Eu f) 
= | [Voire P- 2| Vcorrect ` Väirty 
2 | [essen] ?- 2n | PVconeatl |: | [Vairty | | 
1 
E 5 |I¥eorrecell — 4(00* |Vaisyll^, 


The last line follows from the fact that q? — qr > (q? — r?)/2 for all real q,r. 

So now our goals are a lower bound on ||V¢rrect|| and an upper bound on ||vj., ||. 

We can lower bound ||Vžorrect|| essentially the same way we did before, by lower bounding its 
projection onto the "target" normal vector u: 


S spadt u= Ex). P [CF (x)x) t ul zm E(x y). P [sgn(u li x) (x i u)] = E(x y). P [|x ` ul] > £/16, (16) 


by Lemma 16. 
We upper bound ||v;,.,, || as follows: 


Vill, = Vainy Fx, FOO] 

u " dirty 

—  [lVaittyll* Ex~Diiny oles) : Mi 
V ds 
I 

'g)) 





IA 





* 
Vairty || ° Dairy 

















IA 


Vairty ||O (log( (1 /( n 
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by Lemma 22. Thus ||väinyl| € O(log(1/(n'))). 
Combining this with (16) and (14) we have that if 


1 v/log(1/(n'&) < ce 


for a suitably small constant c, then E&..;»| f(x) (x - v)] is a sum of m i.i.d. random variables, each 
with mean at least O(e?), and coming from an interval of length O(nlogm). Applying the standard 
Hoeffding bound, polynomially many examples suffice for Ex. |f (x) (x: v)] > Q(e?). Combining 
with (13) and (12) completes the proof. 


6. Conclusion 


Our algorithms use boosting together with a confidence-rated weak learner that perform a simple 
averaging of labeled examples. As shown in earlier work (Servedio, 2002, 2003) there are close 
connections between such an approach and the Perceptron algorithm. It seems likely that the Per- 
ceptron could be used as an alternative to boosting and averaging in our algorithms; it would be 
interesting to see if a Perceptron-based approach has any theoretical or empirical advantages over 
the algorithms we give in this paper. 

More generally, there are relatively few algorithms for learning interesting classes of functions 
in the presence of malicious noise. We hope that our results will help lead to the development of 
more efficient algorithms for this challenging noise model. 

As a challenge for future work, we pose the following question: do there exist computationally 
efficient algorithms for learning halfspaces under arbitrary distributions in the presence of malicious 
noise? As of now no better results are known for this problem than the generic conversions of Kearns 
and Li (1993), which can be applied to any concept class. We feel that even a small improvement in 
the malicious noise rate that can be handled for halfspaces would be a very interesting result. 
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Appendix A. Proof of Lemma 7 


Let us start with a couple of definitions and a couple of bounds from the literature. 
Definition 23 (VC-dimension) A set F of ( —1,1]-valued functions defined on a common domain 


X shatters x1, ...,x4 if every sequence y|,...,ya € (—1,1) of function values has a function f such 
that f (x1) = y1,- f (x4) = ya. The VC-dimension of F is the size of the largest set shattered by F. 


Definition 24 (pseudo-dimension) For a set F of real-valued functions defined on a common do- 
main X, the pseudo-dimension of F is the VC-dimension of (sign(f(-) — 9) : f € F,6 € R}. 


Lemma 25 (Pollard 1984; Talagrand 1994) Let F be a set of real-valued functions defined on a 


common domain X taking values in [0,1], and let d be the pseudo-dimension of F. Let D be a 
probability distribution over X. Then if x,,...,X%m are obtained by drawing m times independently 


2737 


KLIVANS, LONG AND SERVEDIO 


according to D, for any 6 > 0, 


Z 1 m d l 1 ô 
Pr are ES) > Eol pa t a / J 25 


where c > 0 is an absolute constant. 








Lemma 26 (see Blumer et al. 1989) The VC-dimension of unions of two halfspaces is O(n). 


Now, let us bound the pseudo-dimension of the class of functions that we need. 


Lemma 27 Let F, consist of the functions f from R” to R which can be defined by f(x) = (a: x}? 
for some a € R”. The pseudo-dimension of F, is at most O(n). 


Proof. According to the definition, the pseudo dimension of F, is the VC-dimension of the set G, 
of (—1,1]-valued functions ga o defined by ga e(x) = sign((a- x)? — 0). Each ga is equivalent to 
an OR of two halfspaces: 

a.x» vð OR (—a)-x> v. 


Thus the VC-dimension of G, is at most the VC-dimension of the class of all ORs of two halfspaces. 
Applying Lemma 26 completes the proof. L| 


Applying Lemmas 25 and 27, we obtain Lemma 7. 


Appendix B. Proof of Lemma 8 


We will use the following, which strengthens bounds like Lemma 25 when the expectations being 
estimated are small. It differs from most bounds of this type by providing an especially strong bound 
on the probability that the estimates are much larger than the true expectations. 


Lemma 28 (Bshouty et al. 2009) Suppose F is a set of {0,1}-valued functions with a common 
domain X. Let d be the VC-dimension of F. Let D be a probability distribution over X. Choose 
a > 0 and K > 4. Then if 





1 1 
T (dlog + log x) | 
ak log K 


where c is an absolute constant, then 





Pr [3f €F, Eo(f) < a but Ey(f) > Ka] < ô, 


uc- p” 
where Eu(f) = $ Eit f(u). 


To prove Lemma 8, we first use the fact that, for any fixed a € S"-! and B > 0, it is known (see 
Kalai et al. 2008) that : 
Pr [a-x| > B] < e”. 
xegn-l 
Further, as in the proof of Lemma 7, we have that 
ja-x|>B ifandonlyif a:x >BOR (—a)-x>B, 
so that the set of events whose probabilities we need to estimate is contained in the set of unions of 


pairs of halfspaces. Applying Lemma 26 and Lemma 28 completes the proof. 
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